-
Notifications
You must be signed in to change notification settings - Fork 10
[CI/Build] Basic server correctness test #237
Conversation
This test is failing today. Something's been broken over the weekend. The exception is:
|
I don't understand why the build was skipped. I didn't try to skip it. |
A couple of notes:
|
b00a664
to
ba3866a
Compare
After rebasing this branch onto main, the test is passing for me with the single Mistral model:
|
Per Slack discussions, I've updated the test to include most of the remaining models in the test execution (some need to be skipped if the model requires a GPU device capability greater than that available on the GPU under test). It was also necessary to ignore "special tokens" output by the HuggingFace runner for a few prompts in a number of models. The practice to simply convert any special token to an empty string worked for all but one test:
the failure is the same for both executions with the same model:
The HuggingFace response in this case w/out the hack had this error:
So, it's not really related to the special token |
Introducing an end-to-end test case that verifies basic correctness of the vllm server by comparing the tokens output by the vllm OpenAI server with tokens generated by the HuggingFace model created with AutoModelForCausalLM.from_pretrained(). Updates HfRunner() to accept a HuggingFace access token to be able to retrieve models that are restricted access The new HfRunnerNM.generate_greedy_logprobs_nm_use_tokens() allows us to compare the HuggingFace generated results (which reports logprobs with token ids) with that from the vllm OpenAI Server (which reports logprobs with token text). This included a new _decode_token_by_position_index() method to properly calculate the token string by using a lookback on the generated tokens list. Enhances the output of the check_logprobs_close() function to provide more details about the failing tokens. Adds the test to the appropriate skip-*.txt files so that this long running test won’t get automatically run during automatic dev push workflows.
Test other models. Skip execution if the model requires a GPU device capability greater than that available on the current device (reusing approach from test_gptq_marlin.py). adds a hack to ignore special tokens after decode of HuggingFace response so that we can fairly compare with vllm server response.
this model fails the test with a specific prompt. to be addressed later.
a9451b9
to
67acb7f
Compare
I've rebased this to latest nm-vllm/main. At this point, the test includes a number of models, but skips a few that don't work w/ HuggingFace out of the box, and one that fails the test for a specific prompt. I've got Asana tickets to address these later, so that we can get this committed and running now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool.
@derekk-nm could you add a README in "neuralmagic" or "neuralmagic/tests" that outlines:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
entries have been moved to the bug report, where failing models will be tracked. removed some additional models that do not work in the build/test env (until a resolution is found) expanded doc on the test case added a README for the *_skip.txt files.
adding tests/basic_correctness/test_basic_server_correctness.py to skip-for-remote-push-tmp.txt
Introducing an end-to-end test case that verifies basic correctness of the vllm server by comparing the tokens output by the vllm OpenAI server with tokens generated by the HuggingFace model created with
AutoModelForCausalLM.from_pretrained()
.Updates
HfRunner()
to accept a HuggingFace access token to be able to retrieve models that are restricted access.The new
HfRunnerNM.generate_greedy_logprobs_nm_use_tokens()
allows us to compare the HuggingFace generated results (which reports logprobs with token ids) with that from the vllm OpenAI Server (which reports logprobs with token text). This included a new_decode_token_by_position_index()
method to properly calculate the token string by using a lookback on the generated tokens list.Enhances the output of the
check_logprobs_close()
function to provide more details about the failing tokens.Adds the test to the appropriate
skip-*.txt
files so that this long running test won’t get automatically run during automatic dev push workflows.To run this test manually;
[assumes that you’ve downloaded and installed the local nm-vllm package with
pip install -e .[sparse]
and all of the packages fromrequirements-common.txt
,reqirements-cuda.txt
, andrequirements-dev.txt
]HF_TOKEN
environment variable with a valid HuggingFace access tokennm-vllm
directory--
python3 -m pytest --forked tests/basic_correctness/test_basic_server_correctness.py -k test_models_on_server
[note that running this from my local env I needed to include the “--import-mode importlib“ option to workaround a known issue in vllm]